Tuesday, February 06, 2018
R such as
R packages we will use in this presentation.R package.install.packages("rmarkdown") ## Intalling packages.
library("rmarkdown") ## Loading packages.
devtools, rmarkdown, knitr, checkpoint, pastecs, psych, magrittr, ggplot2, plotly, gapminder, stargazer, leaflet, DT, gvlma.
R interactive session is started from the R mini BootCamp.Rproj file, the main file for this workshop.ggplot2 package is installed from GitHub.devtools::install_github('hadley/ggplot2')
library("ggplot2")
R package which is capable of importing excel files.R, there are several packages for importing data from an excel file such as XLConnect, xlsx, and readxl.install.packages("XLConnect") ## Installing.
library("XLConnect") ## Loading.
Method 1
data <- readWorksheetFromFile(file = "file.path", ...)
Method 2
workbook <- loadWorkbook(filename = "file.path") data <- readWorksheet(object = workbook, sheet = "Sheet Name or Index", ...)
sheet and header arguments in both methods.
sheet argument specifies the sheet name or index you want to load in the excel file.header argument specifies whether the first row should be used as column names.?readWorksheet in console shows the other arguments you can use.Method 1
data <- readWorksheetFromFile(file = paste0(data.path, "wage1.xls"), sheet = "WAGE1", header = FALSE)
Method 2
workbook <- loadWorkbook(paste0(data.path, "wage1.xls")) data <- readWorksheet(workbook, sheet = "WAGE1", header = FALSE)
numeric class which is indicated with <dbl> sign in the tablewage: average hourly earnings (Col1)educ: years of education (Col2)exper: years potential experience (Col3)tenure: years with current employer (Col4)female: =1 if female (Col6)data <- data[, c(1:4, 6)] ## Subsetting with column index numbers.
# data <- data[, c("Col1", "Col2", "Col3", "Col4", "Col6")] ## Subsetting with column names.
Col1, Col2, …, Col24), we need to assign names to each variables.colnames(data) <- c("Wage", "Education", "Experience", "Tenure", "Gender") ## New names.
Gender variable to factor class where Male is the base level.Ln.Wage variable, natural logarithm of Wage.Experience.Sq variable, square of Experience.data$Gender <- factor(data$Gender, labels = c("Male", "Female")) ## "0" is Male, "1" is Female.
data$Ln.Wage <- log(data$Wage) ## Column binds as the last column.
data$Experience.Sq <- data$Experience^2 ## Column binds as the last column.
col.order <- c("Wage", "Ln.Wage", "Education", "Experience", "Experience.Sq", "Tenure", "Gender")
data <- data[, col.order]
R, missing values are represented with NA.is.na function to check whether a value is missing.sapply(data, function(missing) sum(is.na(missing))) ## Checks number of missing values for all columns.
#> Wage Ln.Wage Education Experience Experience.Sq #> 0 0 0 0 0 #> Tenure Gender #> 0 0
sum(sapply(data, function(missing) sum(is.na(missing)))) ## Checks number of missing values in all columns.
#> [1] 0
str(data)
#> 'data.frame': 526 obs. of 7 variables: #> $ Wage : num 3.1 3.24 3 6 5.3 ... #> $ Ln.Wage : num 1.13 1.18 1.1 1.79 1.67 ... #> $ Education : num 11 12 11 8 12 16 18 12 12 17 ... #> $ Experience : num 2 22 2 44 7 9 15 5 26 22 ... #> $ Experience.Sq: num 4 484 4 1936 49 ... #> $ Tenure : num 0 2 0 28 2 8 7 3 4 21 ... #> $ Gender : Factor w/ 2 levels "Male","Female": 2 2 1 1 1 1 1 2 2 1 ...
dim, nrow, ncol, rownames, colnames, head, and tail functions to explore the structure of the data.# R code chunk is not evaluated. dim(data); nrow(data); ncol(data) rownames(data); colnames(data) head(data); tail(data)
R functions, the descriptive statistics of the manipulated data can be acquired.Wage variable.# R code chunk is not evaluated. min(data$Wage); max(data$Wage) mean(data$Wage); median(data$Wage) var(data$Wage); sd(data$Wage) range(data$Wage) skewness(data$Wage); kurtosis(data$Wage) ## moments and e1071 packages.
sapply can be used to get a descriptive statistics of all variables.sapply(data, function(col) mean(col))
#> Wage Ln.Wage Education Experience Experience.Sq #> 5.896103 1.623268 12.562738 17.017110 473.435361 #> Tenure Gender #> 5.104563 NA
summary(data)
#> Wage Ln.Wage Education Experience #> Min. : 0.530 Min. :-0.6349 Min. : 0.00 Min. : 1.00 #> 1st Qu.: 3.330 1st Qu.: 1.2030 1st Qu.:12.00 1st Qu.: 5.00 #> Median : 4.650 Median : 1.5369 Median :12.00 Median :13.50 #> Mean : 5.896 Mean : 1.6233 Mean :12.56 Mean :17.02 #> 3rd Qu.: 6.880 3rd Qu.: 1.9286 3rd Qu.:14.00 3rd Qu.:26.00 #> Max. :24.980 Max. : 3.2181 Max. :18.00 Max. :51.00 #> Experience.Sq Tenure Gender #> Min. : 1.0 Min. : 0.000 Male :274 #> 1st Qu.: 25.0 1st Qu.: 0.000 Female:252 #> Median : 182.5 Median : 2.000 #> Mean : 473.4 Mean : 5.105 #> 3rd Qu.: 676.0 3rd Qu.: 7.000 #> Max. :2601.0 Max. :44.000
knitr::kable(stat.desc(data)[-c(1:3, 7, 10:11, 14), ], align = "c", digits = 3)
| Wage | Ln.Wage | Education | Experience | Experience.Sq | Tenure | Gender | |
|---|---|---|---|---|---|---|---|
| min | 0.530 | -0.635 | 0.000 | 1.000 | 1.000 | 0.000 | NA |
| max | 24.980 | 3.218 | 18.000 | 51.000 | 2601.000 | 44.000 | NA |
| range | 24.450 | 3.853 | 18.000 | 50.000 | 2600.000 | 44.000 | NA |
| median | 4.650 | 1.537 | 12.000 | 13.500 | 182.500 | 2.000 | NA |
| mean | 5.896 | 1.623 | 12.563 | 17.017 | 473.435 | 5.105 | NA |
| var | 13.639 | 0.283 | 7.667 | 184.204 | 379511.161 | 52.193 | NA |
| std.dev | 3.693 | 0.532 | 2.769 | 13.572 | 616.045 | 7.224 | NA |
knitr::kable(as.data.frame(describe(data))[, -1], align = "c", digits = 3)
| n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Wage | 526 | 5.896 | 3.693 | 4.650 | 5.244 | 2.372 | 0.530 | 24.980 | 24.450 | 2.002 | 4.940 | 0.161 |
| Ln.Wage | 526 | 1.623 | 0.532 | 1.537 | 1.589 | 0.531 | -0.635 | 3.218 | 3.853 | 0.390 | 0.374 | 0.023 |
| Education | 526 | 12.563 | 2.769 | 12.000 | 12.692 | 1.483 | 0.000 | 18.000 | 18.000 | -0.618 | 1.866 | 0.121 |
| Experience | 526 | 17.017 | 13.572 | 13.500 | 15.640 | 14.085 | 1.000 | 51.000 | 50.000 | 0.705 | -0.652 | 0.592 |
| Experience.Sq | 526 | 473.435 | 616.045 | 182.500 | 352.673 | 264.644 | 1.000 | 2601.000 | 2600.000 | 1.488 | 1.280 | 26.861 |
| Tenure | 526 | 5.105 | 7.224 | 2.000 | 3.493 | 2.965 | 0.000 | 44.000 | 44.000 | 2.104 | 4.629 | 0.315 |
| Gender* | 526 | 1.479 | 0.500 | 1.000 | 1.474 | 0.000 | 1.000 | 2.000 | 1.000 | 0.083 | -1.997 | 0.022 |
stargazer(data, type = "text", flip = TRUE, mean.sd = TRUE, min.max = TRUE,
iqr = TRUE, median = TRUE) ## Use 'type = 'html'' for HTML output.
#> #> ================================================================== #> Statistic Wage Ln.Wage Education Experience Experience.Sq Tenure #> ------------------------------------------------------------------ #> N 526 526 526 526 526 526 #> Mean 5.896 1.623 12.563 17.017 473.435 5.105 #> St. Dev. 3.693 0.532 2.769 13.572 616.045 7.224 #> Min 0.530 -0.635 0 1 1 0 #> Pctl(25) 3.330 1.203 12 5 25 0 #> Median 4.650 1.537 12 13.5 182.5 2 #> Pctl(75) 6.880 1.929 14 26 676 7 #> Max 24.980 3.218 18 51 2,601 44 #> ------------------------------------------------------------------
hist(data$Wage)
hist(data$Wage, col = "lightblue", breaks = 50, xlim = c(0, max(data$Wage)),
ylim = c(0, 80), xlab = "Wage", main = "Histogram of Wage")
rug(data$Wage) ## Plots all of the points under the histogram.
abline(v = mean(data$Wage), col = "red", lwd = 1, lty = 2) ## Vertical line.
hist(data$Wage, col = "red", breaks = 100, xlim = c(0, max(data$Wage)), ylim = c(0,
0.7), freq = FALSE, density = 10, xlab = "Wage", main = "Density of Wage")
barplot(table(data$Education), col = "wheat", xlab = "Education", ylab = "Number of Employees",
main = "Number of Employees by Education")
barplot(table(data$Gender, data$Education), col = c("blue", "red"), xlab = "Education",
ylab = "Number of Employees", main = "Number of Employees by Gender and Education")
legend("topright", pch = 15, col = c("blue", "red"), legend = c("Male", "Female"))
boxplot(data$Wage ~ data$Gender, col = c("blue", "red"), names = c("Male", "Female"),
ylab = "Wage", main = "Box Plots of Wage for Male and Female")
plot(x = data$Education, y = data$Wage, axes = FALSE, col = "red", bg = "green",
cex = 0.75, pch = 21, xlab = "Education", ylab = "Wage", main = "Relationship Between Wage and Education")
axis(side = 1, at = c(0:max(data$Education)))
axis(side = 2)
box()
plot(data[, c("Wage", "Education", "Experience")], cex = 0.75)
R data frame called AirPassengers, which shows the monthly airline passenger numbers 1949-1960.plot(AirPassengers, type = "l", ylab = "# of Passangers")
#> We recommend that you use the dev version of ggplot2 with `ggplotly()`
#> Install it with: `devtools::install_github('hadley/ggplot2')`
R, you need to use lm function for linear regressions.model <- lm(formula = data$Wage ~ data$Education, singular.ok = FALSE) ## Sometimes useful. model <- lm(data = data, formula = Wage ~ Education, singular.ok = FALSE) ## Better for printing.
# R code chunk is not evaluated. attributes(model) ## Gives the model attributes for quick model estimation results. model$coefficients; coef(model) ## Coefficients. confint(model) ## Confidence interval of coefficients. model$residuals; residuals(model) ## Residuals. model$df.residual ## Degrees of freedom for residuals. model$fitted.values; fitted(model) ## Fitted values.
summary(model) ## Prints detailed model estimation results.
#> #> Call: #> lm(formula = Wage ~ Education, data = data, singular.ok = FALSE) #> #> Residuals: #> Min 1Q Median 3Q Max #> -5.3396 -2.1501 -0.9674 1.1921 16.6085 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -0.90485 0.68497 -1.321 0.187 #> Education 0.54136 0.05325 10.167 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 3.378 on 524 degrees of freedom #> Multiple R-squared: 0.1648, Adjusted R-squared: 0.1632 #> F-statistic: 103.4 on 1 and 524 DF, p-value: < 2.2e-16
# R code chunk is not evaluated. model.sum <- summary(model) str(model.sum) ## Structure of model summary object. model.sum$coefficients ## Coefficients. model.sum$r.squared ## R squared. model.sum$adj.r.squared ## Adjusted R squared. model.sum$sigma ## Sigma (residual standard error). model.sum$cov.unscaled ## Variance-covariance matrix. model.sum$df ## Degrees of freedoms model.sum$fstatistic ## F-statistics of the model.
stargazer(model, type = "text", style = "qje") ## For a nice tabulated printing, use stargazer package.
#> #> ========================================================== #> Wage #> ---------------------------------------------------------- #> Education 0.541*** #> (0.053) #> #> Constant -0.905 #> (0.685) #> #> N 526 #> R2 0.165 #> Adjusted R2 0.163 #> Residual Std. Error 3.378 (df = 524) #> F Statistic 103.363*** (df = 1; 524) #> ========================================================== #> Notes: ***Significant at the 1 percent level. #> **Significant at the 5 percent level. #> *Significant at the 10 percent level.
with(data, plot(Education, Wage, main = "Estimated Regression Line")) abline(model, col = "red")
ANOVA), you should use the anova function.print.data.frame(anova(model))
#> Df Sum Sq Mean Sq F value Pr(>F) #> Education 1 1179.732 1179.73205 103.3627 2.782597e-22 #> Residuals 524 5980.682 11.41352 NA NA
par(mfrow = c(2,2)) # Change the panel layout to 2 x 2. plot(model)
par(mfrow = c(1,1)) # Change back to 1 x 1.
R functions for model diagnosis are listed below.# R code chunk is not evaluated. plot(gvlma(model)) ## Model diagnosis plots (from gvlma package). summary(gvlma(model)) ## Checks all the linear regression assumptions (from gvlma package). gvlma(model) ## Similar to above but without model estimation results. vif(model) ## For Multicollinearity outlierTest(model) ## For Outliers. outlierTest(model) ## Bonferonni p-value for most extreme observations. qqPlot(model, main = "QQ Plot") # Normality of residuals (qq plot for studentized residuals). shapiro.test(residuals(model)) ## Normality check with Shapiro-Wilk test. ad.test(residuals(model)) ## Normality check with Anderson-Darling test. ncvTest(model) ## Non-constant error variance test (homoskedasticity).
+ sign to the formula argument in the lm function.model <- lm(data = data, formula = Wage ~ Education + Experience + Tenure, singular.ok = FALSE) stargazer(model, type = "html", style = "qje") ## For a nice tabulated printing, use stargazer package.
| Wage | |
| Education | 0.599*** |
| (0.051) | |
| Experience | 0.022* |
| (0.012) | |
| Tenure | 0.169*** |
| (0.022) | |
| Constant | -2.873*** |
| (0.729) | |
| N | 526 |
| R2 | 0.306 |
| Adjusted R2 | 0.302 |
| Residual Std. Error | 3.084 (df = 522) |
| F Statistic | 76.873*** (df = 3; 522) |
| Notes: | ***Significant at the 1 percent level. |
| **Significant at the 5 percent level. | |
| *Significant at the 10 percent level. | |
base level of the dummy variable.
Male is the base level of the dummy variable Gender.base level of a dummy variable by re-defining the base level with ref argument in the relevel function, e.g., relevel(data$Gender, ref = "Female").model.1 <- lm(data = data, formula = Wage ~ Education + Experience + Gender,
singular.ok = FALSE) ## Base level: Male.
data$Gender <- relevel(data$Gender, ref = "Female")
model.2 <- lm(data = data, formula = Wage ~ Education + Experience + Gender,
singular.ok = FALSE) ## Base level: Female.
data$Gender <- relevel(data$Gender, ref = "Male") ## Revert the base level back to Male.
stargazer(model.1, model.2, type = "html", style = "all2")
| Dependent variable: | ||
| Wage | ||
| (1) | (2) | |
| Education | 0.603*** | 0.603*** |
| (0.051) | (0.051) | |
| Experience | 0.064*** | 0.064*** |
| (0.010) | (0.010) | |
| GenderFemale | -2.156*** | |
| (0.270) | ||
| GenderMale | 2.156*** | |
| (0.270) | ||
| Constant | -1.734** | -3.890*** |
| (0.754) | (0.727) | |
| Observations | 526 | 526 |
| R2 | 0.309 | 0.309 |
| Adjusted R2 | 0.305 | 0.305 |
| Residual Std. Error (df = 522) | 3.078 | 3.078 |
| F Statistic (df = 3; 522) | 77.920*** (p = 0.000) | 77.920*** (p = 0.000) |
| Note: | p<0.1; p<0.05; p<0.01 | |
model.1 <- lm(data = data, formula = Wage ~ Experience + Experience^2) ## Not correct. model.2 <- lm(data = data, formula = Wage ~ Experience + I(data$Experience^2)) ## Correct. model.3 <- lm(data = data, formula = Wage ~ Experience + Experience.Sq) ## Correct.
| Dependent variable: | |||
| Wage | |||
| (1) | (2) | (3) | |
| Experience | 0.031*** | 0.298*** | 0.298*** |
| (0.012) | (0.041) | (0.041) | |
| Experience2) | -0.006*** | ||
| (0.001) | |||
| Experience.Sq | -0.006*** | ||
| (0.001) | |||
| Constant | 5.373*** | 3.725*** | 3.725*** |
| (0.257) | (0.346) | (0.346) | |
| Observations | 526 | 526 | 526 |
| R2 | 0.013 | 0.093 | 0.093 |
| Adjusted R2 | 0.011 | 0.089 | 0.089 |
| Residual Std. Error | 3.673 (df = 524) | 3.524 (df = 523) | 3.524 (df = 523) |
| F Statistic | 6.766*** (df = 1; 524) (p = 0.010) | 26.740*** (df = 2; 523) (p = 0.000) | 26.740*** (df = 2; 523) (p = 0.000) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
model.1 <- lm(data = data, formula = Ln.Wage ~ log(Experience), singular.ok = FALSE) model.2 <- lm(data = data, formula = Wage ~ Education + Experience + Tenure, singular.ok = FALSE) model.3 <- lm(data = data, formula = Ln.Wage ~ Education + Experience + Tenure, singular.ok = FALSE) model.4 <- lm(data = data, formula = Wage ~ Education + Experience + Tenure + Gender, singular.ok = FALSE)
| Dependent variable: | ||||
| Ln.Wage | Wage | Ln.Wage | Wage | |
| (1) | (2) | (3) | (4) | |
| log(Experience) | 0.117*** | |||
| (0.021) | ||||
| Education | 0.599*** | 0.092*** | 0.572*** | |
| (0.051) | (0.007) | (0.049) | ||
| Experience | 0.022* | 0.004** | 0.025** | |
| (0.012) | (0.002) | (0.012) | ||
| Tenure | 0.169*** | 0.022*** | 0.141*** | |
| (0.022) | (0.003) | (0.021) | ||
| GenderFemale | -1.811*** | |||
| (0.265) | ||||
| Constant | 1.343*** | -2.873*** | 0.284*** | -1.568** |
| (0.056) | (0.729) | (0.104) | (0.725) | |
| Observations | 526 | 526 | 526 | 526 |
| R2 | 0.055 | 0.306 | 0.316 | 0.364 |
| Adjusted R2 | 0.053 | 0.302 | 0.312 | 0.359 |
| Residual Std. Error | 0.517 (df = 524) | 3.084 (df = 522) | 0.441 (df = 522) | 2.958 (df = 521) |
| F Statistic | 30.421*** (df = 1; 524) (p = 0.00000) | 76.873*** (df = 3; 522) (p = 0.000) | 80.391*** (df = 3; 522) (p = 0.000) | 74.398*** (df = 4; 521) (p = 0.000) |
| Note: | p<0.1; p<0.05; p<0.01 | |||